import { Callout } from "nextra/components";

## Prometheus Metrics for Hatchet

<Callout type="warning">Only works with v1 tenants</Callout>

This document provides an overview of the Prometheus metrics exposed by Hatchet, setup instructions for the metrics endpoint, and example PromQL queries to analyze them.

### Setup

To enable and configure the Prometheus metrics endpoint in your Hatchet server, set the following environment variables (bound to Viper keys as shown):

- **`SERVER_PROMETHEUS_ENABLED`** (`prometheus.enabled`)
  - Type: boolean
  - Default: `false`
  - Description: Enables or disables the Prometheus metrics HTTP server.

- **`SERVER_PROMETHEUS_ADDRESS`** (`prometheus.address`)
  - Type: string
  - Default: `":9090"`
  - Description: The network address and port to bind the Prometheus metrics server to.

- **`SERVER_PROMETHEUS_PATH`** (`prometheus.path`)
  - Type: string
  - Default: `"/metrics"`
  - Description: The HTTP path at which metrics will be exposed.

If you have set up a Prometheus instance to scrape Hatchet metrics, you can enable the [tenant API endpoint](/home/prometheus-metrics) by setting the following variables:

- **`SERVER_PROMETHEUS_SERVER_URL`** (`prometheus.prometheusServerURL`)
  - Type: string
  - Description: The Prometheus server URL.

- **`SERVER_PROMETHEUS_SERVER_USERNAME`** (`prometheus.prometheusServerUsername`)
  - Type: string
  - Description: The username to access the Prometheus instance via HTTP basic auth.

- **`SERVER_PROMETHEUS_SERVER_PASSWORD`** (`prometheus.prometheusServerPassword`)
  - Type: string
  - Description: The password to access the Prometheus instance via HTTP basic auth.

**Example environment setup:**

```bash
export SERVER_PROMETHEUS_ENABLED=true
export SERVER_PROMETHEUS_ADDRESS=":9999"
export SERVER_PROMETHEUS_PATH="/custom-metrics"
```

Restart your Hatchet server after setting these variables to apply the changes.

---

### Global Metrics

| Metric Name                          | Type      | Description                                                                       |
| ------------------------------------ | --------- | --------------------------------------------------------------------------------- |
| `hatchet_queue_invocations_total`    | Counter   | The total number of invocations of the queuer function                            |
| `hatchet_created_tasks_total`        | Counter   | The total number of tasks created                                                 |
| `hatchet_retried_tasks_total`        | Counter   | The total number of tasks retried                                                 |
| `hatchet_succeeded_tasks_total`      | Counter   | The total number of tasks that succeeded                                          |
| `hatchet_failed_tasks_total`         | Counter   | The total number of tasks that failed (in a final state, not including retries)   |
| `hatchet_skipped_tasks_total`        | Counter   | The total number of tasks that were skipped                                       |
| `hatchet_cancelled_tasks_total`      | Counter   | The total number of tasks cancelled                                               |
| `hatchet_assigned_tasks_total`       | Counter   | The total number of tasks assigned to a worker                                    |
| `hatchet_scheduling_timed_out_total` | Counter   | The total number of tasks that timed out while waiting to be scheduled            |
| `hatchet_rate_limited_total`         | Counter   | The total number of tasks that were rate limited                                  |
| `hatchet_queued_to_assigned_total`   | Counter   | The total number of unique tasks that were queued and later assigned to a worker  |
| `hatchet_queued_to_assigned_seconds` | Histogram | Buckets of time (in seconds) spent in the queue before being assigned to a worker |
| `hatchet_reassigned_tasks_total`     | Counter   | The total number of tasks that were reassigned to a worker                        |

#### Example PromQL Queries

##### 1. Rate of calls to the queuer method

```promql
rate(hatchet_queue_invocations_total[5m])
```

##### 2. Average queue time in milliseconds

```promql
# Calculates average queue time over the past 5 minutes, converted to ms
rate(hatchet_queued_to_assigned_seconds_sum[5m])
  / rate(hatchet_queued_to_assigned_seconds_count[5m])
  * 1e3
```

##### 3. Success and failure rates

```promql
rate(hatchet_succeeded_tasks_total[5m])
rate(hatchet_failed_tasks_total[5m])
```

##### 4. Queue time distribution (histogram)

```promql
sum by (le) (
  rate(hatchet_queued_to_assigned_seconds_bucket[5m])
)
```

##### 5. Rate of tasks created vs. retried

```promql
rate(hatchet_created_tasks_total[5m])
rate(hatchet_retried_tasks_total[5m])
```

##### 6. Task Assignment Rate

```promql
rate(hatchet_assigned_tasks_total[5m])
```

##### 7. Scheduling Timeout Rate

```promql
rate(hatchet_scheduling_timed_out_total[5m])
```

##### 8. Rate Limiting Impact

```promql
rate(hatchet_rate_limited_total[5m])
```

##### 9. Task Completion Ratio (Success vs Total)

```promql
rate(hatchet_succeeded_tasks_total[5m])
/
(rate(hatchet_succeeded_tasks_total[5m]) + rate(hatchet_failed_tasks_total[5m]))
```

##### 10. Task Cancellation Rate

```promql
rate(hatchet_cancelled_tasks_total[5m])
```

##### 11. Task Skip Rate

```promql
rate(hatchet_skipped_tasks_total[5m])
```

##### 12. Queue Processing Efficiency (Assigned vs Created)

```promql
rate(hatchet_assigned_tasks_total[5m]) / rate(hatchet_created_tasks_total[5m])
```

##### 13. Task Reassignment Rate

```promql
rate(hatchet_reassigned_tasks_total[5m])
```

### Tenant Metrics

| Metric Name                                      | Type      | Description                                                                          |
| ------------------------------------------------ | --------- | ------------------------------------------------------------------------------------ |
| `hatchet_tenant_workflow_duration_milliseconds`  | Histogram | Duration of workflow execution in milliseconds (DAGs and single tasks)               |
| `hatchet_tenant_queue_invocations_total`         | Counter   | The total number of invocations of the queuer function                               |
| `hatchet_tenant_created_tasks_total`             | Counter   | The total number of tasks created                                                    |
| `hatchet_tenant_retried_tasks_total`             | Counter   | The total number of tasks retried                                                    |
| `hatchet_tenant_succeeded_tasks_total`           | Counter   | The total number of tasks that succeeded                                             |
| `hatchet_tenant_failed_tasks_total`              | Counter   | The total number of tasks that failed (in a final state, not including retries)      |
| `hatchet_tenant_skipped_tasks_total`             | Counter   | The total number of tasks that were skipped                                          |
| `hatchet_tenant_cancelled_tasks_total`           | Counter   | The total number of tasks cancelled                                                  |
| `hatchet_tenant_assigned_tasks`                  | Counter   | The total number of tasks assigned to a worker                                       |
| `hatchet_tenant_scheduling_timed_out`            | Counter   | The total number of tasks that timed out while waiting to be scheduled               |
| `hatchet_tenant_rate_limited`                    | Counter   | The total number of tasks that were rate limited                                     |
| `hatchet_tenant_queued_to_assigned`              | Counter   | The total number of unique tasks that were queued and later got assigned to a worker |
| `hatchet_tenant_queued_to_assigned_time_seconds` | Histogram | Buckets of time in seconds spent in the queue before being assigned to a worker      |
| `hatchet_tenant_reassigned_tasks`                | Counter   | The total number of tasks that were reassigned to a worker                           |
| `hatchet_tenant_used_worker_slots`               | Gauge     | The current number of worker slots being used                                        |
| `hatchet_tenant_available_worker_slots`          | Gauge     | The current number of worker slots available (free)                                  |
| `hatchet_tenant_worker_slots`                    | Gauge     | The total number of worker slots (free + used)                                       |

#### Example PromQL Queries

##### 1. Workflow Duration by Tenant and Status

```promql
rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
by (tenant_id, workflow_name, status)
/
rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
by (tenant_id, workflow_name, status)
```

##### 2. Tenant Queue Performance (95th percentile)

```promql
histogram_quantile(0.95,
  rate(hatchet_tenant_queued_to_assigned_time_seconds_bucket[5m])
) by (tenant_id)
```

##### 3. Tenant Error Rate by Workflow

```promql
rate(hatchet_tenant_failed_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
```

##### 4. Tenant Task Throughput

```promql
rate(hatchet_tenant_succeeded_tasks_total[5m]) by (tenant_id)
```

##### 5. Tenant Retry Rate

```promql
rate(hatchet_tenant_retried_tasks_total[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
```

##### 6. Workflow Duration Distribution by Tenant

```promql
sum by (tenant_id, le) (
  rate(hatchet_tenant_workflow_duration_milliseconds_bucket[5m])
)
```

##### 7. Tenant Rate Limiting Impact

```promql
rate(hatchet_tenant_rate_limited[5m]) by (tenant_id)
```

##### 8. Per-Tenant Queue Utilization

```promql
rate(hatchet_tenant_queue_invocations_total[5m]) by (tenant_id)
```

##### 9. Tenant Scheduling Timeouts

```promql
rate(hatchet_tenant_scheduling_timed_out[5m]) by (tenant_id)
```

##### 10. Tenant Task Assignment Success Rate

```promql
rate(hatchet_tenant_assigned_tasks[5m]) by (tenant_id)
/
rate(hatchet_tenant_created_tasks_total[5m]) by (tenant_id)
```

##### 11. Tenant Task Reassignment Rate

```promql
rate(hatchet_tenant_reassigned_tasks[5m]) by (tenant_id)
```

### Cross-Tenant Analysis

#### Example PromQL Queries

##### 1. Top 5 Tenants by Task Volume

```promql
topk(5,
  sum by (tenant_id) (
    rate(hatchet_tenant_created_tasks_total[1h])
  )
)
```

##### 2. Slowest Workflows Across All Tenants

```promql
topk(10,
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[5m])
  /
  rate(hatchet_tenant_workflow_duration_milliseconds_count[5m])
) by (tenant_id, workflow_name)
```

##### 3. Tenant Resource Consumption Comparison

```promql
sum by (tenant_id) (
  rate(hatchet_tenant_workflow_duration_milliseconds_sum[1h])
)
/ 1000 / 60  # Convert to minutes
```

### Integration with Prometheus

This endpoint can be used to configure Prometheus to scrape tenant-specific metrics:

```yaml
scrape_configs:
  - job_name: "hatchet-tenant-metrics"
    static_configs:
      - targets: ["cloud.onhatchet.run"]
    metrics_path: "/api/v1/tenants/707d0855-80ab-4e1f-a156-f1c4546cbf52/prometheus-metrics"
    scheme: "https"
    authorization:
      credentials: "your-api-token-here"
```

**Note:** Replace `cloud.onhatchet.run` with the URL where your Hatchet instance is hosted.

This provides tenant-isolated metrics that can be scraped directly by Prometheus or consumed by other monitoring tools that support the Prometheus text format.
